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Abstract. We present here a new algorithm for the fast computation of Appoint 
correlation functions in large astronomical data sets. The algorithm is based on hd- 
trees which are decorated with cached sufficient statistics thus allowing for orders 
of magnitude speed-ups over the naive non-tree-based implementation of correla- 
^ , tion functions. We further discuss the use of controlled approximations within the 

■ computation which allows for further acceleration. In summary, our algorithm now 

£f>^ ' makes it possible to compute exact, all-pairs, measurements of the 2, 3 and 4-point 

ff) , correlation functions for cosmological data sets like the Sloan Digital Sky Survey 

£SJ ' (SDSS; York et al. 2000) and the next generation of Cosmic Microwave Background 

t-H ' experiments (see Szapudi et al. 2000). 

o : 
o : 

^ 1 Introduction 

Q ■ Correlation functions are some of the most widely used statistics within as- 

| trophysics (see Peebles 1980 for a extensive review). They are often used to 

quantify the clustering of objects in the universe (e.g. galaxies, quasars etc.) 
\ compared to a pure Poission process. More recently, they have also been used 

to measure fluctuations in the Cosmic Microwave Background (see Szapudi 
k>( | et al. 2000). On large scales, the higher-order correlation functions (3-point 

; i ■ and above) can be used to test several fundamental assumptions about the 

universe; for example, our hierarchical scenario for structure formation, the 



Gaussianity of the initial conditions as well as testing various models for the 
biasing between the luminous and dark matter. The reader is referred to Sza- 
pudi (2000), Szapudi et al. (1999a, b) and Scoccimarro (2000; and references 
therein) for an overview of the usefulness of TV-point correlation functions in 
constraining cosmological models. 

Over the coming decade, several new, massive cosmological surveys will 
become available to the astronomical community. In this new era, the qual- 
ity and quantity of data will warrant a more sophisticated analysis of the 
higher-order correlation functions of galaxies (and other objects) over the 
largest range of scales possible. Our ability to perform such studies will be 
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severely limited by the computational time needed to compute such functions 
and no-longer by the amount of data available. In this paper, we address this 
computational "bottle-neck" by outlining a new algorithm that uses innova- 
tive computer science to accelerate the computation of ./V-point correlation 
functions far beyond the naive 0(R N ) scaling law (where R is the number of 
objects in the dataset and N is the power of correlation function desired). 

The algorithm presented here was developed as part of the "Computa- 
tional AstroStatistics" collaboration (see Nichol et al. 2000) and is a member 
of a family of algorithms for a very general class of statistical computations, 
including nearest-neighbor methods, kernel density estimation, and cluster- 
ing. The work presented here was initially presented by Gray & Moore (2001) 
and will soon be discussed in a more substantial paper by Connolly et al. 
(2001). In this conference proceeding, we provide a brief review of Ad-trees 
(Section |J), a discussion of the use of fcd-trees in range searches (Section 
|3|), an overview of the development of a fast ./V-point correlation function 
code (Section ^) as well as presenting the concept of controlled approxima- 
tions in the calculation of the correlation function (Section ||). In Section 6, 
we provide preliminary results on the computation speed-up achieved with 
this algorithm and discuss future prospects for further advances in this field 
through the use of other tree structures. 

2 Review of M-trees 

Our fast iV-point correlation function algorithm is built upon the fcd-tree 
data structure which was introduced by Friedman et al. (1977). A fcd-tree 
is a way of organizing a set of datapoints in fc-dimensional space in such a 
way that once built, whenever a query arrives requesting a list all points in 
a neighborhood, the query can be answered quickly without needing to scan 
every single point. 

The root node of the Ad-tree owns all the data points. Each non-leaf-node 
has two children, defined by a splitting dimension ti-SplitDim and a splitting 
value ti.SplitValue. The two children divide their parent's data points between 
them, with the left child owning those data points that are strictly less than 
the splitting value in the splitting dimension, and the right child owning the 
remainder of the parent's data points: 



As an example, some of the nodes of a M-tree are illustrated in Figures 1. 

fcd-trees are usually constructed top-down, beginning with the full set of 
points and then splitting in the center of the widest dimension. This produces 
two child nodes, each with a distinct set of points. This procedure is then 
repeated recursively on each of the two child nodes. 



x.j G ji.Left Xjjn.SPLrrDiM] < n.SPLiTVAniE and Xj G n 
Xj G n. Right Xj[n. Split-Dim] > ti.SplitValue and X,£n 
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Figure la: The top node of a kdtree is sim- 
ply a hyper-rectangle surrounding the data- 



Figure lb: The second level contains two 
nodes. 



Figure lc: The third level contains four 

nodes. Note how a parent node creates its Figure Id: The set of nodes in the sixth level 
two children by splitting in the centers of its of the tree, 
widest dimension 



A node is declared to be a leaf, and is left unsplit, if the widest dimension 
of its bounding box is < some threshold, MinBox Width. A node is also left 
unsplit if it denotes fewer than some threshold number of points, r m ; n . A leaf 
node has no children, but instead contains a list of fc-dimensional vectors: 
the actual datapoints contained in that leaf. The values MinBoxWidth = 
and r min = 1 would cause the largest M-tree structure because all leaf nodes 
would denote singleton or coincident points. In practice, we set MinBoxWidth 
to 1% of the range of the data point components and r min to around 10. The 
tree size and construction thus cost considerably less than these bounds be- 
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cause in dense regions, tiny leaf nodes are able to summarize dozens of data 
points. The operations needed in tree-building are computationally trivial 
and therefore, the overhead in constructing the tree is negligible. Also, once 
a tree is built it can be re- used for many different analysis operations. 

Since the introduction of Ad-trees, many variations of them have been 
proposed and used with great success in areas such as databases and com- 
putational geometry (Preparata & Shamos 1985). R-trees (Guttman 1984) 
are designed for disk resident data sets and efficient incremental addition of 
data. Metric trees (see Uhlmann 1991) place hyperspheres around tree nodes, 
instead of axis-aligned splitting planes. In all cases, the algorithms we discuss 
in this paper could be applied equally effectively with these other structures. 
For example, Moore (2000) shows the use of metric trees for accelerating 
several clustering and pairwise comparision algorithms. 

3 Range Searching 

Before proceeding to fast iV-point calculations, we will begin with a very 
standard M-tree search algorithm that could be used as a building block for 
fast 2-point computations. 

For simplicity of exposition we will assume the every node of the fcd-tree 
contains one extra piece of information: the bounding box of all the points 
it contains. Call this box ti.boundBox. The implication of this is that every 
node must contain two new k dimensional vectors to represent the lower 
and upper limits of each dimension of the bounding box. The range search 
operation takes two inputs. The first is a /c-dimensional vector q called the 
query point. The second is a separation distance Shi- The operation returns 
the complete set of points in the H-tree that lie within distance Shi of q. 

• RangeSearch(n, q, Shi) 

Returns a set of points S such that 

xeS«x£n and |x — q| < Shi (3) 

• Let MinDist :— the closest distance from q to 7j.boundBox. 

• If MinDist > s^i then it is impossible that any point in n can be within 
range of the query. So simply return the empty set of points without 
doing any further work. 

• Else, if n is a leaf node, we must iterate through all the datapoints in its 
leaf list. For each point, find if it is within distance sm of q. If so, add it 
to the set of results. 

• Else, n is not a leaf node. Then: 

— Let SWt := RangeSearch(n.LEFT, q, s^i) 

— Let SVight ■— RangeSearch(n. right, q, su) 

— Return SWt U Slight- 
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Figure 2a: The shaded rectangles denote 

Figure 2b: When the range is larger there are 

nodes that were pruned during a search for 

fewer opportunities for pruning. 

the set of points that lie inside the circle. 



Figure 2a shows the result of running this algorithm in two dimensions. 
Many large nodes are pruned from the search. 117 distance calculations were 
needed for performing this range search, compared with 499 that would have 
been needed by a naive method. 

Note that it is not essential that M-tree nodes have bounding boxes ex- 
plicitly stored. Instead a hyper-rectangle can be passed to each recursive call 
of the above function and dynamically updated as the tree is searched. 

Range searching with a fcd-tree can be much faster than without if the 
range is small, containing only a small fraction of the total number of data- 
points. But what if the range is large? Figure 2b shows an example in which 
M-trees provide little computational saving because almost all the points 
match the query and thus need to be visited. In general this problem is un- 
avoidable. But in one special case it can be avoided — if we merely want to 
count the number of datapoints in a range instead of explicitly find them all. 

3.1 Range Counting and Cached Sufficient Statistics 

We will add the following field to a fcd-tree node. Let n.NuMPoiNTs be the 
number of points contained in node n. This is the first and simplest of a set 
of Ad-tree decorations we refer to as cached sufficient statistics (see Moore 
& Lee 1998). In general, we frequently stored the centroid of all points in a 
node and their covariance matrix. 

Once we have n.NuMPoiNTs it is trivial to write an operation that counts 
the number of datapoints within some range without explicitly visiting them. 
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Fig. 3. When doing 
a range count, nodes 
entirely within range 
can also be pruned 
and added to the 
total count. This ad- 
ditional pruning adds 
significant speed-ups 
to the slower range 
count discussed in 
Figure 2b. Now one 
just spends time 
studying nodes on 
the boundary of the 
range count. 



• Range Count( ?i, q, Shi) 

Returns an integer: the number of points that are both inside the n and 
also within distance Shi of q. 

• Let MinDist := the closest distance from q to ti.boundBox. 

• If MinDist > Shi then it is impossible that any point in n can be within 
range of the query. So simply return 0. 

• Let MaxDist := the furthest distance from q to ti.boundBox. 

• If MaxDist < Shi then every point in n must be within range of the 
query So simply return ti.numPoints. 

• Else, if n is a leaf node, we must iterate through all the datapoints in 
its leaf list. Start a counter at zero. For each point, find if it is within 
distance Shi of q. If so, increment the counter by one. Return the count 
once the full list has been scanned. 

• Else, n is not a leaf node. Then: 

— Let Cieft := RangeCount(n. left, query, Shi) 

— Let C r ight := RangeCount(n. right, query, s/i,) 

— Return CW t + C ri ght- 

The same query that gave the poor range search performance in Figure 
2b gives good performance in Figure 3. The difference is that a second type 
of pruning of the search is possible: if the hyperrectangle surrounding the n 
is either entirely outside or inside the range then we prune. 
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4 Fast N— point Correlation Functions 

4.1 The Single Tree Approach to Two- Point Computation 

It is easy to see that the 2-point correlation function is simply a repeated set 
of range counts. For example, given a minimum and maximum separation si 
and Shi we run the following algorithm: 

• SingleTree2Point(X, n, s; , Shi) 

Input X is a dataset, represented as a matrix in which the fcth row corre- 
sponds to the fcth datapoint. X has R rows and k columns. Input n is the 
root of a kdtree built from the data in X. Output integer: the number of 
pairs of points (xj,Xj) such that si < |xj — Xj| < Shi- 

• C := 

• For i between 1 and R do: 

— C :— C + RangeCount(n, Xj, sm) — RangeCount(n, x i5 s; Q ) 

Note that in practice we do not use two range counts at each iteration, but 
one slightly more complex rangecount operation 

RangeCountBetweenSeparations(n, q, s; Q , Shi) 

that directly counts the number of points whose distance from q is between 
sio and Shi- 

4.2 The Dual Tree Approach to Two-Point Computation 

The previous algorithm iterates over all datapoints, issuing a range count 
operation for each. We can save further time by turning that outer iteration 
into an additional kd-tree search. The new search will be a recursive procedure 
that takes two nodes, n a and rib, as arguments. The goal will be to compute 
the number of pairs of points (x,y) such that x e n a , y e rib, and si D < 
|x - y| < Shi- 

• DualTreeCount(n a , rib, s io, Shi) 

Returns an integer: the number of pairs of points (x, y) such that x G n a , 
y e rib, and s to < |x - y| < s hi . 

• Let MinDist := the closest distance between n a .BouNDBox and rib-BouNDBox. 

• If MinDist > Shi then it is impossible that any pair of points can match. 
So simply return 0. 

• Let MaxDist := the furthest distance between n a .BouNDBox and rib-BouNDBox. 

• If MaxDist < si D then it is again impossible that any pair of points can 
match. So simply return 0. 

• If si < MinDist < MaxDist < sm then all pairs of points must match. 
Use n a .NuMPoiNTs and rib-NuMPoiNTs to compute the number of resulting 
pairs n a .NuMPoiNTs x rib-NuMPoiNTs, and return that value. 
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• Else, if n a and n\, are both leaf nodes, we must iterate through all pairs 
of datapoints in their leaf lists. Return the resulting (slowly computed) 
count. 

• Else at least one of the two nodes is a non-leaf. Pick the non-leaf with the 
largest number of points (breaking ties arbitrarily), and call it n*. Call 
the other node n~ . Then: 

— Let Ci c ft := DualTreeCount(n*. left, n~, s; , s;,,i) 

— Let CVight := DualTreeCount(n*. right, n _ , si Q , su) 

— Return C lcft + C right . 

Computing a 2-point function on a dataset X then simply consists of com- 
puting the value C = DualTreeCount(n root , n roo t> si , Shi), where n root is 
a kd-tree built from X, for a range of bins with minimum and maximum 
boundaries of si and hisep. We note here that the 2-point correlation func- 
tion, the quanity of interest is not simply C, but C/2 (the number of unique 
pairs of objects). 

A further speed-up can be obtained by simultaneously computing the 
DualTreeCount(n root , n root , s; Q , Shi) over a series of bins. We will discuss 
this in further detail in Connolly et al. (2001). 



4.3 Redundancy Elimination 

So far, we have discussed two operations - exclusion and subsumption - which 
remove the need to traverse the whole tree thus speeding-up the computation 
of the correlation function. Another form of pruning is to eliminate node- node 
comparisons which have been performed already in the reverse order. This 
can be done simply by (virtually) ranking the datapoints according to their 
position in a depth-first traversal of the tree, then recording for each node the 
minimum and maximum ranks of the points it owns, and pruning whenever 
n a 's maximum rank is less than rib's minimum rank. This is useful for all- 
pairs problems, but will later be seen to be essential for all-k-tuples problems. 
This kind of pruning is not practical for Single-tree search. 



4.4 Multiple Trees Approach to TV Point Computation 

The advantages of Dual- Tree over Single- Tree are so far two fold. First, Dual- 
tree can be faster, and second it can exploit redundancy elimination. But two 
more advantages remain. First, we can extend the "2-tree for 2-point" method 



up to "N-trees for N-point" . Second (discussed in Section 5.1), we can perform 
effective approximation with Dual-trees (or n-trees) . We now discuss the first 
of these advantages. 

The A-point computation is parameterized by two n x n symmetric ma- 
trices: L and H. We wish to compute 

R R R 

E E ••■ E I(L,H,h,i2,...i n ) (4) 

l 1 =l 12 — 1 in — 1 
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where I(L, H, i\, i 2 , . . . i n ) is zero unless the following conditions hold (in 
which case it takes the value 1): 

VI < i < j < n, L {h]) < \x u - x 4j I < H {ltj) (5) 

We will achieve this by calling a recursive function FastNPoint on an n-tuplc 
of kdtree nodes (m, n 2 ■ ■ ■ n n ). This recursive function much return 

E E •■• E I{L,H,i u i 2 ,...i n ) (6) 

• FastNPoint(n a , n h ,s lo ,s hi ) 

• Let AllSubsumed:=TRUE 

• For i = 1 to n do 

— For j = i + 1 to n do 

* Let MinDist := the closest distance between 774 .boundBox and 

Tlj .BoundBox. 

* If MinDist > Hu^ then it is impossible that any n-tuple of 
points can match because the distance between the ith. and jth 
points in any such n-tuple must be out of range. So simply return 
0. 

* Let MaxDist := the furthest distance between vij .BoundBox and 
rij .BoundBox. 

* If MaxDist < Luj\ then similarly return 0. 

* If < MinDist < MaxDist < then every n-tuplc 
has the property the the ith member and jth member match. 
We are interested in whether this is true for all pairs and 
so the first time we are disappointed (by discovering the above 
expression does not hold) then we will update the AllSubsumed 
flag. Thus the actual computation at this step is: 

If > MinDist or MaxDist > H^ ^ then 
AllSubsumed : =FALSE. 

• If AllSubsumed has remained true throughout the above double loop, 
we can be sure that every n-tuple derived from the nodes in the recursive 
call must match, and so we can simply return 

n 

J^J Tli.NUMPOINTS (7) 

1=1 

• Else, if all of ni,n 2 ,...n n are leaf nodes we must iterate through all 
n-tuples of datapoints in their leaf lists. Return the resulting (slowly 
computed) count. 

• Else at least one of the nodes is a non leaf. Pick the non-leaf with the 
largest number of points (breaking ties arbitrarily), and assume it has 
index i = i* . Then: 
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— Let Cioft := FastNPoint (ni, . . . , ?v -Left, . . . , n n ) 

— Let CVight := FastNPoint(ni, . . . , ?v .right, . . . , n n ) 

— Return C lcft + C rig ht- 

The full A-point computation is achieved by calling FastNPoint with 
arguments consisting of an n-tuple of copies of the root node. 

We should note once again it is possible to save considerable amounts of 
computation by eliminating redundancy. For example, in the 4-point statis- 
tic, the above implementation will recount each matching 4-tuplc of points 
(x,y,z,w) in 24 different ways: once for each of the 4! permutations of 
(x, y, z, w). Again, this excess cost can be avoided by ordering the datapoints 
via a depth-first tree indexing scheme and then pruning any n-tuple of nodes 
violating that order. But the reader should be aware of an extremely messy 
problem regarding how much to award to the count in the case that a sub- 
sume type of pruning can take place. If all nodes own independent sets of 
points the answer is simple: the product of the node counts. If all nodes are 
the same then the answer is again simple: (^) , where n is the number of 
points in the node. Somewhat more subtle combinatorics are needed in the 
case where some nodes in the n-tuple are identical and others are not. And 
fearsome computation is needed in the various cases in which some nodes are 
descendants of some other nodes. 

5 Controlled Approximation 

In general, when the final answer comes back from FastNPoint, the ma- 
jority of the quantity in the count will be the sum of components arising 
from large subsume prunes. But the majority of the computational effort will 
have been spent on accounting for the vast number of small but unprunable 
combinations of nodes. We can improve the running time of the algorithm by 
demanding that it also prunes it search in cases in which only a tiny count 
of n-tuplcs is at stake. This is achieved by adding a parameter, T, to the 
FastNPoint algorithm, and adding the following lines at the start: 

• Let C max : = Il™ =1 n;.NuMPoiNTs 

• If C max < T then quit this recursive call. 

This will clearly cause an inaccurate result, but fortunately it is not hard to 
maintain tight lower and upper bounds on what the true answer would have 
been if the approximation had not been made. Thus FastNPoint (n\, n 2 , ■ ■ ■ ,n n 
now returns a pair of counts (C , Chi) where we can guarantee that the true 
count C lies in the range C\ Q < C < Chi- 

5.1 Iterative Deepening for Controlled Approximation 

Suppose the true value of the TV-point function is C but that we are prepared 
to accept a fractional error of e: we will be happy with any value C approx such 
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Fig. 4. The com- 
putational time of 
our algorithm versus 
the size of dataset. 
The crosses are real 
2-dimensional pro- 
jected galaxy data 
while the dots are 
just drawn from a 
Poission distribution. 
The theoretically 
expected scaling law 
of Ny/N is shown 
and agrees well with 
the observed data. 
The naive iV 2 law 
is also plotted for 
comparison 

that 

|Capprox — C\ < eC (8) 

It is possible to adapt the n-tree algorithm using a best-first iterative deep- 
ening search strategy to guarantee this result while exploiting permission 
to approximate effectively by building the count as much as possible from 
"easy-win" node pairs while doing approximation at hard deep node-pairs. 
This is simply achieved by repeatedly calling the previous approximate algo- 
rithm with diminishing values of T until a value is discovered that satisfies 
Equation ^. 

6 Discussion 

We plan to present a more detailed discussion of the techniques presented 
here in a forthcoming paper (Connolly et al. 2001). That paper will also in- 
clude a full analysis of the computational speed and overhead of our iV-point 
correlation function algorithm and compare those with existing software for 
computing the higher-order correlation functions e.g. Szapudi et al. 1999a. 
However, in Figure ^, we present preliminary results on the scaling of com- 
putational timing needed for a 2-point correlation function as a function 
of the number of objects in the data set. For these tests, we computed all 
the data-data pairs for random data sets and real, projected 2-dimensional 
galaxy data. These data show that our 2-point correlation function algorithm 
scales as 0(N y/~N) (for projected 2-dimensional data) compared to the naive 
all-pairs scaling of 0(N 2 ) where here N is the size of the dataset under con- 
sideration. To emphasis the speed-up obtained by our algorithm (Figure [IJ), 
an all-pairs count for a database of 10 7 objects would take only 10 hours (on 
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our DEC Alpha workstation) using our methodology compared to ~ 10, 000 
hours (> 1 year) using the naive N 2 method. Clearly, binning the data would 
also drastically increase the speed of analyses over the naive all-pairs 0(N 2 ) 
scaling but at the price of lossing of resolution. 

Similar spectacular speed-ups will be achieved for the 3 and 4-point func- 
tions and we will report these results elsewhere (Connolly et al. 2001). Fur- 
thermore, controlled approximations can further accelerate the computations 
by several orders of magnitude. Such speed-ups are vital to allow Monte Carlo 
estimates of the errors on these measurements. In summary, our algorithm 
now makes it possible to compute an exact, all-pairs, measurement of the 2, 
3 and 4-point correlation functions for data sets like the Sloan Digital Sky 
Survey (SDSS). These algorithms will also help in the speed-up of Cosmic 
Microwave Background analyses as outlined in Szapudi et al. (2000). 

Finally, we note here that we have only touched upon one aspect of how 
trees data structures (and other computer science techniques) can help in 
the analysis of large astrophysical data sets. Moreover, there are other tree 
structures beyond M-trees such as ball trees which could be used to optimize 
our correlation function codes for higher dimensionality data. We will explore 
these issues in future papers. 
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